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Abstract 

In this paper we study a bootstrap strategy for estimating the variance 
of a mean taken over large multifactor crossed random effects data sets. 
We apply bootstrap reweighting independently to the levels of each fac- 
fjQ tor, giving each observation the product of independently sampled factor 



X 



weights. No exact bootstrap exists for this problem (McCullagh 20001 
<^ We show that the proposed bootstrap is mildly conservative, meaning bi- 

+J ased towards overestimating the variance, under sufficient conditions that 

allow very unbalanced and heteroscedastic inputs. Earlier results for a re- 
sampling bootstrap only apply to two factors and use multinomial weights 
that are poorly suited to online computation. The proposed reweighting 
approach can be implemented in parallel and online settings. The results 
^ for this method apply to any number of factors. The method is illustrated 

ly-^ using a 3 factor data set of comment lengths from Facebook. 

Keywords: Bayesian pigeonhole bootstrap, online bagging, online bootstrap, 
relational data, tensor data, unbalanced random effects 

1 Introduction 

. . Large sparse data sets with two or more crossed random effects commonly arise 

from electronic commerce and Internet services, and we may expect them to 
arise in other settings as automated data gathering becomes more prevalent. 
$J] Such data often have no IID structure for us to draw on. For example, with 

the famous Netflix data ( Bennett and Lanning 2007 1 multiple ratings from the 



same viewer are dependent. Similarly multiple ratings on the same movie are 
dependent. Neither rows nor columns are IID, and a crossed random effects 
model with interactions is a more reasonable structure. 

In Internet data there can easily be more than two crossed factors. The 
individual factors could be user account numbers, IP addresses, URLs, search 
query strings or identifiers for documents placed in web pages. The response 
variable might be a measure of user engagement such as time spent reading, or 
system performance such as the load times for pages under different versions of 
software. 
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The crossed random effects setting is challenging for inference. Methods 



Searle et al. ( 1992 ) rely on Gaussian data assumptions and outside of balanced 



cases, the necessary linear algebra becomes prohibitively expensive on large 
problems. 

We might therefore turn to resampling. For IID data, the bootstrap provides 
reliable variance estimates and confidence intervals under very weak assump- 
tions on the mechanism generating our data. But McCullagh (2000) proved 
that there does not exist an exact bootstrap algorithm for crossed random ef- 



fects. Specifically if X. 
variables 



bj and 



y — fj, + a, + bj + 
with variances a 2 A , 



Eij for independent mean random 
a\ and a 2 E respectively, then no re- 
sampling method, from a very broad class, will provide an unbiased estimate of 
VarKIJ)- 1 !)^ £/=!*«)■ 

One approach to bootstrapping crossed data is to independently bootstrap 
the indices of each factor. In bootstrapping a factor we are putting a random 
multinomially distributed weight on the levels of that factor. For an r-fold data 
set, the observation X ili2 ... ir gets a weight W*^... lr = J]j=i W$ where W$ 
is the weight on level ij of the j'th factor in the &'th bootstrap reweighting. For 
each j and each b, weight vectors (W* b ,W*^, . . . , W* b N .) are sampled indepen- 
dently. Given weights on all the data, we compute a weighted version of the 
statistic(s) of interest to get the 6'th bootstrap value. 

Bootstrapping with a product of multinomial weights has been studied be- 
fore, 



for 



2. Brennan et al. (1987) and Wiley (2001 1 use it to study variance 



components in educational test data. McCullagh (2000) shows that indepen- 
dently bootstrapping the rows and columns of a data matrix gives a mildly 
conservative estimate of variance. That is, it has a positive bias that is usu- 
ally relatively small. McCullagh (2000) considered balanced crossed random 



effects (no missing values) with homoscedastic variance components. Owen 



( 2007 ) shows that this bootstrap remains conservative (and usually mildly so) for 
sparsely sampled unbalanced crossed random effects allowing for heteroscedas- 
ticity. That framework allows every row and column (e.g. customer and movie) 
and even every interaction to have its own variance. Resampling is then reliable 
and it spares the analyst from having to estimate all of those variances. 

The random weighting that we favor is a product of completely indepen- 



dent weights: W i 



* h 



n 



. w* b 



where for each b and each j, WX. are 



IID weights with mean 1 and variance 1. For these large data sets, meth- 



ods that reweight data via IID random weights (Rubin 1981[ Newton and 



Raftery 1994) are an appealing alternative to the multinomial weights used 
in resampling. First, it is simpler to apply independent reweighting to large 



scale parallelized computations, as is done in online bagging and boosting (Oza 



2001 Lee and Clyde 2004). The reason is that large data sets are stored in 



a distributed fashion and then multinomial sampling brings substantial com- 
munication and synchronization costs. Second, resampling simplifies variance 
expressions by avoiding the negative dependence from the multinomial distribu- 
tion. This makes it easier to develop expressions for problems with more than 
two factors. 

Using notation and approximations defined below, the main facts are as 
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follows. For r = 2 factors, we suppose the data are sampled by a random effects 
model with variance components cr^j, <r| 2 | , and 2 j corresponding to the 
main effects and interaction respectively. We can express the variance of the 
sample average of N observations in the form (^{i}0m + v {"i} a \i\ + a \\ 2})/^- 
The subscripted v quantities are easily computable from the data and we give 
explicit formulas. Naive bootstrapping produces an estimate close to (aL + 

<7 {2} + a \\ 2} ) 1^ which is grossly inadequate because it turns out that often 
vij\ 3> 1. For instance, in the Netflix data set, the largest vij\ is about 56,200. 

Resampling both rows and columns leads to a variance estimate close to 
((^{1} + (^{2} +2)c{ 2 } +3 a {i 2})/-^! w hich is mildly conservative when 

v {j} ^ an< ^ ^he cr's are of comparable magnitude. It is up to three times as 
large as it should be in the event that <r?o <C er^ 2 }- Being conservative by a 
factor of at most 3 is far more acceptable than underestimating variance by as 
much as 56,200. 

Our main contributions are as follows: 

1) We show that a naive bootstrap suitable for IID settings severely under- 
estimates the variance of the sample mean, when r — 2, while the product 
strategy mildly over-estimates it. These facts were known for resampling, 
but we show it also for reweighting. 

2) We generalize the reweighting results to r > 2 factors. In particular for 
the homoscedastic setting, the 3(7^ 2 j variance contribution from the case 

r = 2 becomes (2 r — 1)<t?j 2 r j . We find expressions for all 2 r — 1 variance 
coefficients. Under reasonable conditions, for which we note exceptions, 
this bootstrap magnifies a fc-factor variance component by roughly 2 fe — 1. 
Under simply described conditions, the k = 1 terms dominate the variance 
and then the variance magnification becomes negligible. 

3) We introduce a heteroscedastic random effects model in which every non- 
empty subset of factors contributes a random effect. The product weighted 
bootstrap remains mildly conservative even when every factorial effect for 
every observation has a distinct variance, so long as all the variances are 
uniformly bounded away from and infinity. 

An outline of the paper is as follows. Section [2] introduces our notation for 
the random effects model and some observation counts and then defines the 
random effects variance that we seek to estimate. Section [3] considers naive 
bootstrap methods that simply resample or reweight the observations as if they 
were IID. They seriously underestimate the true variance unless the only nonzero 
variance component is that of the highest order interaction. Reweighting has 
a slight advantage because it allows one to step up the sampling variance to 
compensate for cases where the naive bootstrap variance is only a modest un- 
derestimate. Section|4]introduces a factorial reweighting bootstrap strategy. For 
data with r = 2 factors, the reweighting results closely match the resampling 
results from Owen| ( |2007[ ) . This section includes an interpretable approximation 
to the exact bootstrap variance. Section [6] considers the heteroscedastic case, 
where every variance component at every combination of its factors has its own 
variance parameter. When the main effects are dominant, then the proposed 
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bootstrap closely matches the desired variance even in the heteroscedastic set- 
ting. Section [7] describes repeated observations and factors nested inside the 
ones being reweighted. Section [8] has a numerical example from Facebook. In 
that dataset, UK-based users make longer comments than do US-based users, 
when posting from mobile devices. The reverse holds for comments made at 
Facebook's standard web interface. The differences are small, but statistically 
significant, even after taking account of a three factor structure (commenter, 
sharer, and URL). The proofs appear in an Appendix. 

Although the product reweighting algorithm is simple, its analysis in the 
random effects context is very technical. Section [9] discusses some larger sta- 
tistical issues. Among these are the reasons that we do not model the possible 
informativeness of the missing data mechanism, the reasons for focusing on the 
bootstrap variance of a sample mean, and the motivation for considering the 
heteroscedastic random effects model, which contains many more parameters 
than observations. 

2 Notation and random effects model 

The random variables of interest take the form A^,^... Jr € M. d for integers ij > 
1 and j = 1, . . . , r. To simplify notation, we write Xi for i = . . . , i r ). We 
work with A of dimension d = 1. The generalization to d > 1 is straightforward. 
We have in mind applications where each value of ij corresponds to one level 
of a categorical variable with many potential values. In Internet applications, 
index values ij might represent users, URLs, IP addresses, ads, query strings 
and so on. There may be no a priori upper bound on the number of distinct 
levels for ij. 

The data are composed of N of these random variables, where 1 < N < oo. 
The binary variable Zi takes the value 1 when observation Aj is present and 
Zi = when Xi is absent. We work conditionally on Zi so that they are 
nonrandom. In practice the pattern of missingness in Zi may be important. As 
with prior work, we avoid modeling Zi in order to focus on estimating variance, 
apart from some brief remarks in Section [9j 

The letters u and v denote subsets of [r] = {l,...,r} throughout. The 
summation 1S taken over all 2 r subsets of [r], and other summations, such 
as ^2 vDu denote sums over the first named set (here v) subject to the indicated 
condition with the other set(s) (here u) held fixed. The index i u extracts the 
components ij for j G u. Then i u = i' u means that ij = i'j for all j G u. 

Our r-fold crossed random effects model is 

X i = ^+Z^ £ i.u (1) 

where (jgl and Si tU are mean random variables that depend on i only through 
i u . We have £i )U = _ u if i u = i' u and Si. u independent of u otherwise. The 
covariance of Ei tU and y is 

Cov(ei, u ,e < / j14 /) = E(£i,«£i>') = o'S 1 «=« /1 in=i'„ ( 2 ) 
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for < oo. 

To illustrate the model notation, suppose that r = 2 and one observation is 
at i = (38,44) while another is at % = (38, 19). Then 

Xi = X( 38i 44) = [i + £(38,44), {1} + £(38,44), {2} + £(38,44), {1, 2} j an d 
Xi' = X(38,19) = A* + £ (38,19),{1} + £ (38,19), {2} + £(38, 19), {1,2} • 

Because i and i' share a value for ii they have the same random effect for the 
set u = {1}. That is £ijn — £»',{].}• This is the only effect that they share 
and so Cov(Xi, JQ/) = C/n- More generally, suppose that two indices i and % 
satisfy ij — i'j for and only for j E u. Then Xi and X^ share random effects 
£i,u = £i',v for all nonempty v C u and so CovpQ,^/) = J2v.0^vCu a l- 

The expression £(38,44),{i} is mildly redundant since the second index 12 = 44 
is ignored. We could have written it as £(38), {l}- Such a choice amounts to 
writing the general case as £i u , u which is more cumbersome when it appears in 
lengthy expressions. 

The sample mean of X is the ratio 

X = Y J X i Z i /Y J Z i , (4) 

i i 

where the sums are over all index values i. The denominator in Q is the 
total number N of observations. Our goal is to estimate the variance of X by 
resampling methods. 

2.1 Partial duplicate observations 

We will need to keep track of the extent to which different observations have 
the same index values, in order to properly reflect correlations among the X^. 
For each i and u C [r], the number 

counts how many observations match Xi for all indices j £ u. If Zi — 1 then 
Xi :U > 1 because Xi matches itself. By convention Ni, = N and Ni m = 1- 
The quantity 

1 

Vu = ]v Z * N *,u - 1 

i 

is the average number of matches in the subset u for observations in the data 
set, and !A r i = 1. 

The most important of the v u are for singletons u = {j}. The value Vsj\ has 
a quadratic dependence on the pattern of duplication in the data. To see this, 
write ngj = Zi\i j= i for the number of times that variable j is equal to t in 
the data. Then = N^ 1 YleLi n ? j because each Ni ^ — n.^j appears rag- 
times in the summation defining 
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If u C v then v u > In some applications ^ u 3> ZA> for proper subsets u C 
u. For those applications, multiple matches are very unusual. In other settings 
two factors, say %\ and i% might be highly though not perfectly dependent (e.g., 
customer ID and phone number) and then ^{1.2} might be only slightly smaller 
than V{ X } or V{2}- We return to this issue in Section [5] 

The specific pair of data values i and i' match in components 

M u , = {j e [r] I i 3 = »;•}. 

For the motivating data, most of the Ma' are empty and most of the rest have 
cardinality \Ma'\ = 1. We have \Ma'\ — r if and only if i = i . Although M^i 
is defined for all pairs i and i' we only use it when Z^Z^ = 1, that is when both 
Xi and Xi' have been observed, and the term 'most' above refers to these pairs. 
For each i and k = 0, 1, . . . , r, the number 

Ni y k = Zj'l\M u ,\=k 
i' 

counts how many observations match Xi in exactly k places. 



2.2 Random effects variance of X 

Here we record the true variance of X, using the random effects model. This is 
the quantity we hope to estimate by bootstrapping. 

Theorem 1. In the random effects model ([T]) 

Var(X) = l£ v u o*. (5) 

■11^0 

The contributions of the variance components o~\ are proportional to the 
duplication indices v u . For large sparse data sets we often find that 1 <§; v u <C N 
when < \u\ < r. 

Our bootstrap approximations to this variance are centered around a quan- 
tity (1/^V) J2 U ^0 f° r § am coefficients j u that depend on the data config- 
uration and the particular bootstrap method. Ideally we want j u = v u . More 
realistically, some bootstrap methods are able to get 7^ > v u with j u just barely 
larger than v u for the singletons u = {j} which we expect to dominate Var(X). 



3 Naive bootstrap methods 



There are two main ways to bootstrap: resampling (Efron 19791 and reweight- 
ing (iRubin 1981 ), with the distinction being that the former uses a multinomial 



distribution on the data while the latter applies independent random weights 
to the observations. 

Naive bootstrap methods simply resample or reweight the ./V observations 
without regard to their factorial structure. That is they use the same bootstrap 
one might use for IID samples. Here we show that naive bootstrap resampling 
and rcweighting have very similar and very unsatisfactory performance. 
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3.1 Naive resampling 

In the naive bootstrap, all N observations are resampled with replacement. The 
naive bootstrap variance of X converges to 

Var NB (X) = J Lj2 Z i( X i- *) 2 ( 6 ) 

i 

as the number of resampled data sets tends to infinity. 

Theorem 2. Under the random effects model ([T]), the expected value of the 
naive bootstrap variance of X is 

E RE (Var NB (X)) = 1 £ a 2 u (l - ^). (7) 

When r > 1, the naive bootstrap can severely underestimate the coefficients 
of o-jjy We can see the effect in the Netflix data, which has r = 2. The gain 
coefficients v u can be computed directly. For a random variable X following 
the random effects model |l]) with variance components CT mov i OS , of aters and 

°fnovicsx raters we nave 

Var(X) = — (56,200 a^ovies + 646 cr r 2 ators + cr^ oviesXratera ), while 

VarNB(^) < ^("movies + "raters + "'movies X raters ) ) 

where N = 100,000,000. If X has a large variance components for movies, the 
underestimation can be severe. Even quantities dominated by a rater effect will 
have a naive bootstrap variance far too small. 

Theorem [2] generalizes Lemma 2 of Owen (2007) which treats naive bootstrap 



sampling for r — 2. We note that Owen (2007 p. 391) has an error: it gives the 



coefficient of 2 j as 1/N where it should be 1/(N — 1) 
3.2 Naive re weighting 



Posterior sampling under the Bayesian bootstrap (Rubin 1981) uses indepen- 
dent Exp(l) weights on the sample values. This corresponds to a posterior dis- 
tribution on X that is a Dirichlet distribution with parameter vector (1, 1, . . . , 1) 
with a 1 for each observation. The corresponding prior is a degenerate Dirichlet 
with a parameter of on all possible values for the random variable. The pos- 
terior is degenerate putting probability on any value of X that was never seen 
in the sample, thus eliminating the user's need to know which possible values 
were not in fact observed. This motivation is simplest when the observations 
are assumed to be distinct as for example with continuously distributed values, 
but the method is also used on data with ties. 

In the naive Bayesian bootstrap, all N observations are given random weights 
which arc then normalized. Observation i gets weight Wi ~ G independently 
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sampled. We assume that G has mean 1 and variance r 2 < oo. Typically 
r 2 = 1. 

The original Bayesian bootstrap (Rubin 1981) had Wi ~ Exp(l) but other 



distributions are useful too. Taking Wi ~ Poi(l) gives a result very similar 
to the usual bootstrap, and it has integer weights. Independent Bin(JV, 1/N) 
weights would provide a more exact match, but for large N there is no practical 



difference between Bin(iV, 1/N) and Poi(l). See Oza (2001) and Lee and Clyde 



( 2004 ) for uses of independent reweighting in bagging and boosting. 

Taking Wi ~ U{0, 2} = (<5 + 62) /2 also has integer values. The algorithm 
goes 'double or nothing' independently on all N observations. The nonzero 
integer values are all equal, so these weights correspond to using a random 
unweighted subset of the data. Double-or-nothing weighting is then a version of 



half-sampling methods (McCarthy 19691 without the constraint on the sum of 



weights, just as Poisson weighting removes a sum constraint from the original 
bootstrap. 

The choice of weights makes a small difference to the bootstrap performance. 
See Section [331 

Each bootstrap resampled mean takes the form 

X* = T*/N* 

where T* = Y,i WiZiXi and N* = £\ WiZi. The bootstrap mean T* /N* is a 
ratio estimator of X. The asymptotic formula for the variance is 

Vai-NBB^n = ^Enbb((T* -XN*f). 

The tilde on VarNBB is a reminder that this formula is a delta method approxi- 
mation: it is the variance of a Taylor approximationrto X* . Because N is usually 
very large in the target applications, we consider VarNBB to be a reliable proxy 
for VarNBB • 



Theorem 3. In the random effects model ([!]) 



E RE (Var N BB(X*)) = if J2 ^i}-^)- ^ 

u=L0 

The naive Bayesian bootstrap using r 2 = 1 has the same average variance as 
the naive bootstrap. In large data sets we may find that v u » r 2 (l — v u /N) and 
then the Bayesian bootstrap greatly underestimates the true variance. When 
max u ^ v u is not too large, then Theorem[3]offers a way to counter this problem. 
We can take simply multiply the naive bootstrap variance by r 2 = max„^ v u 
to get conservative variance estimates. The largest v u comes from u = {j} for 
some j £ [r] and it is an easy quantity to compute. 

3.3 Bootstrap stability 

Any distribution on weig hts with E(W) = 1 and Yax(W) = r 2 will have the 
same value for Ere (VarNBB (X* ) ) . Several different weight distributions are 
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popular in the literature. Oza (2001 1 takes Wij to be independent Poisson ran- 



dom variables with mean 1. This creates a very close approximation to the 



original bootstrap's multinomial weights. Lee and Clyde (2004) prefer expo- 
nential weights with mean 1 because they yield an exact online version of the 
Bayesian bootstrap. 

In this section we look at the effect of the weight distribution. Other things 
being equal, we prefer a bootstrap to yield a stable variance estimate. That is, 
we like a smaller variance under bootstrap sampling for the estimated variance 
of the mean. For this purpose it is better to have weights with a small kurtosis 
k = E((W — 1) 4 )/t 4 — 3. The smallest possible kurtosis for weights with mean 
1 and variance 1 arises for weights uniformly distributed on the values and 2. 
The kurtosis of the data, k x = (l/N)Y^ z i( x i ~ X) 4 /a 4 - 3 where a 2 = 
(1/-/V) J2i Zi(Xi — X) 2 , also plays a role. We work out the consequences for the 
naive bootstrap for simplicity. 

If we hold the observations Xi fixed and implement the bootstrap, doing 
some number B of replicates, we will estimate the quantity 

i i' 

where Yi — Xi — X. To estimate this variance we may use 

~ 1 B 

VarNBB^n = ZiZi'W^W^bYiYi, 



BN 2 

6=1 i %' 



(9) 



B ^— ' V N 

6=1 



where Wi^ are independent identically distributed random weights and b = 
1, . . . , B indexes the bootstrap replications. The hat in ^ represents estimation 
from B bootstrap samples. It is possible to use ([9| with B = 1. That such a 
"unistrap" is possible, stems from the use of a delta method approximation. 
Equation |9]) is not the usual estimator. The more usual variance estimate 



S NBB 



^^BTlEW-^ 2 ( 10 ) 



6=1 



where 



1 1 B 

iJ^^W,, and x: = -J2x*b- (11) 



6=1 



Theorem 4. Let W and be IID random variables with mean 1 variance 
t 2 and kurtosis k w < oo . Then holding Yi = Xi — X fixed, 

Var NBB (Va7'NBB(X*)) = ^ ( 2 + k{Kx + 3) 



BN 2 V N 
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where a 2 = (1/JV) J2i z i Y i> and K * = (W) Ei ZiY^/a 4 - 3. .4 de/to mei/iod 
approximation gives 

. 2 , . (t 4 t 4 / 2B K ( Ka; + 3)\ 
Var NBB ( SNBB ) = — ^_ + j . 

When k(k x + 3) <C N, then Var NBB (X*) with B reweightings has approxi- 
mately the variance of Snbb(^*) w hh B + 1 reweightings. 

We find here that there are only small differences between weighting schemes, 
but double-or-nothing weights having the smallest possible kurtosis k = — 2 have 
the best stability. The Poi(l) distribution has k = 1 and the Exp(l) distribution 
has k = 6. 



4 Factorial re weighting 

Our proposal here is to apply a product of independent random weights to the 
data. Observation i is given weight W\ > 0. The weights take the form 

r 

Wi = n (12) 

3=1 

where Wjj. are independent random variables for j € [r] and ij > 1. We assume 
that E(Wj,i j ) = 1 and Vai(Wj : i j ) — r? < oo. The usual choice has all rj equal 
to a common r 2 which in turn is usually equal to 1. 

For the example in equation |3]), the observation at index i — (38,44) gets 
weight Wi — Wi,38 W244. It shares one weight factor with the observation at 
i! = (38, 19) which has' W v = ^1,38^2,19- 

The reweighted mean X* is once again a ratio estimate with delta method 
approximation 

VarpwPH - ^E PW ((r - XN*) 2 ) (13) 



where T* = ]T\ and N* = J2i Z i w i for w i S iven b Y @- The sub- 

script PW refers to random weights taking the product form. 

The bootstrap variance depends on precise details of the overlaps among 
different observations. We will derive some approximations to this variance 
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below. For the exact variance we need to introduce some additional quantities: 



Pk 



J2^2^2 Z i Z i' 1 \^hi'\=k 



N 2 



Vk,u = -Tj ^ E ZiZ i'' i -\^hi'\=k' i -i u =i' u , an d, 

i i' 

Vk,u = E E E ZiZi'Zi"l\M ii ,\=kli u =iZ' 
i i' i" 

i 

for k = 0, 1, . . . , r and u C [r]. In words, gives the fraction of data pairs that 
match in exactly k positions, while Vk,u/N gives the fraction of data pairs that 
match in exactly k positions including all j £ u. The third quantity, Vk.m is N 
times the fraction of data triples ,i ) in which i matches % in precisely k 
places while also matching i for all j G it. 
These new quantities satisfy the identities 

r r r 

yjpfc = 1, and f fei „ = = f tt . 

fc=0 fc=0 fc=0 

Also it is clear that v^,u = when \u\ > k. 
Theorem 5. In the random effects model ([lj 

E RE (VaJ PW (X*)) = ^ E ( 14 ) 

where 

r 

7« = E^ 1 + r2 ) k ( v k : u - 2? fe: „ + PkV u )- (15) 

A:=0 

The quantities 7„ are 'gain coefficients' which multiply a 2 /N . Ideally they 
should equal v u and then the bootstrap variance would match the desired one. 
Where they differ from v u , the bootstrap variance is biased. Typically the bias is 
positive, making this bootstrap conservative. Sometimes the bias is very small. 

The special case r = 1 is interesting because it corresponds to IID sam- 
pling. Then the only variance component is cr?^ which we abbreviate to a 2 and 
equation ( 14 1 simplifies to 

1 

1 



N \ N) N -1 

In this instance there is a (trivial) negative bias if r 2 = 1 . 

Independently reweighting rows and columns is similar to independently 
resampling them. That strategy of bootstrapping rows and columns has been 
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given several names in the literature. Brennan et al. (19871 called it "boot- 
p,i" because for educational testing data, it resamples both people and items. 



McCullagh| ( |2000| page 294) calls the method "Boot-IP . There is also another 
"Boot-II" for the one way layout in that paper. Noting a similarity to Cornfield 
and Tukey's pigeonhole model for analysis of variance, Owen (2007) calls this 
approach the "pigeonhole bootstrap" . Reweighting with a product of Rubin's 
(1981) exponential weights is thus a "Bayesian pigeonhole bootstrap". 



5 Interpretable approximations 

Theorem [5] gives exact finite sample formulas for the gain coefficients j u , but 
they are unwieldy. Here we make some approximations to j u in order to get 
more interpretable results. 

First we introduce the quantity 

N i.u N i,W 

e — max max — — = max max — — — , 

i u^0 N i l<j<r N 

which measures the largest proportional duplication of indices. Though 1 > 
e > 1/N, we anticipate that e will usually be small. For the Netflix data, 
e = 232,944/100,480,507 = 0.00232 stemming from one movie having 232,944 
ratings. 

Although we suppose that e is small below, it is worth pointing out that 
exceptions do arise, even for some very large data sets. For example if the ob- 
served data form a complete Ni x N2 x • • • x N r sample, then e = maxi< j< r 1 /Nj . 
If one factor takes only a modest number of levels, then e is large. A second 
context where e is large arises when one of the factors is greatly dominated by 
one of its levels, as for example, we might find in Internet data where one factor 
is the country of the web user. 

A second parameter to aid interpretability is 

v v 

i] = max — . 

0CuCv V u 

By construction r\ < 1, and we ordinarily expect rj to be small. Of the indices 
which match for j £ u only a relatively small number should also match for 
j £ v — u too, because each additional match in large data sets represents a 
coincidence. For the Netflix data 

77 = max{i/{ 1)2 }/i/{ 1 },i/{ 1)2 }/i/{ 2 }} = 1/646 = 0.00155. 

While 77 is often small, there are exceptions. If two factors are very dependent 
then r\ need not be small. For example people's names and phone numbers 
may be such variables: many or even most phone numbers are used by a small 
number of people (often one) and many people use only a small number of phone 
numbers. Then the fraction of data pairs matching on both of these variables 
will not be much smaller than the fraction matching on one of them. 
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In simplifying expressions we use 0(r]) and O(e). These describe limits as 
77 (respectively e) converge to 0. The implied constants may depend on r. In 
some expressions we have retained explicit constants. 

Theorem 6. In the random effects model the gain coefficient (151 for 
u =/= in the product reweighted bootstrap is 

lu = „„[(1 + r 2 )H - 1 + 9 u e] + £(1 + r 2 )H( r 2)l—l^ (16) 
w/iere |0„| < (1 + t 2 )((1 + r 2 )' r - l)/r 2 . for t 2 = 1, 

w/iere |0J < 2 r+1 - 2. 

For r = 2 using ^{1,2} = 1 and the usual choice r 2 = 1, we find that 

= + e {J} 6 ) + 2 < ■? = 1 ' 2 ' and 

7{i,2} = ^{1,2} (3 + 0{i,2}e) 

where each |0| < 6. The Bayesian pigeonhole bootstrap variance closely matches 
the ordinary pigeonhole bootstrap variance. In the extreme setting where tr?-^ = 

a {2} — < a \\ 2} the resulting bootstrap variance is about three times as high 
as it should be. In a limit as min, vij\ — > 00 and e — > 0, 

ER E (Var P w(X*)) 

Var(X) 1 J 

holds for fixed er 2 ^ > 0, j = 1, 2. For r = 3, with ^{1.2,3} = 1 and t 2 = 1 

7{1} ~ ^{1} + 4^(1.2} + 4v{i,a} + 8 

7{i,2} ~ ^{1,2} + 8, and 

7(1,2,3} ~ 7, 

where ss reflects an additive error of size v u 9 u e for \9 U \ < 14. In the extreme case 
where the only nonzero variance coefficient is a 2 ^ then the product reweighted 
bootstrap variance is about 7 times as large as it should be. On the other hand, 
when the main effect variances a 2 ^ x are positive and v v jv u — > for v C u, 
then (171 holds. More generally, we have Theorem [7j 



Theorem 7. For the random effects model ([I]) and the product reweighted boot- 
strap with t 2 — 1, the gain coefficient for nonempty u C [r] satisfies 

2 M _ 1 _ (2 r+1 - 2)e < — < 2l u l(l + 2n)\ v - u \ - 1 + (2 r+1 - 2)e. 
// £/iere e:m£ m and M u/ii/i 0<m<<r^<Af<oo for all w ^0, i/ien 

^^». 1+0( , +e , 
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The first claim of Theorem [7] can be summarized as 

^ = (2'*! - 1)(1 + 0(7])) + 0(e) « 2^ - 1, 

and the second as E RE ( Var PW (X*)) /Var(X) s» 1, where the implied constants 
on r. They generally grow exponentially in r but the interesting values of r are 
small integers from 2 to 6 or so. The main effects dominate when r\ is small and 
they are properly accounted for when e is small. 



6 The heteroscedastic model 

In the r-fold crossed random effects model 0, the term Si <u has the same 
variance for all i. This model may not be realistic. For instance, the Netflix data 
includes some customers whose ratings have very small variance and others with 
a very large variance. Similarly, but to a lesser extent, movies also differ in the 
variance of their ratings. Unequal variances have the potential to bias inferences, 
especially in unbalanced cases, because the entities with more observations on 
them might have systematically higher variance than the others do. 

A more realistic model is the heteroscedastic r-fold crossed random 
effects model, with 

X i = A* + X] £i >" ( 18 ) 

where \i G R and ei iU are independent random variables with mean and 
variance cr? . There are more variance parameters than observations, we do 



not need to estimate them. Owen (20071 gives conditions under which the 



pigeonhole bootstrap with r — 2 produces a variance estimate with relative 
error tending to zero in the heteroscedastic setting. Here we investigate product 



reweighting with general r for model (18). 



We need some new quantities. For u ^ 0, define 



N- 



i' 

Vi,k,u — ^ Zi'\\M 4i ,\=k^-i u =i'„ ■ 



We also will use 



2 



i 



and 
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Next, we parallel the development from the ordinary random effects model (JlJ 
Theorem M gives the exact variance of X for heteroscedastic random effects, 
Theorem [9] gives the gain coefficients under product reweighting, Theorem 10 
provides interpretable bounds for the gains in terms of e. Finally Theorem 11 



gives conditions under which the product reweighted bootstrap has a negligible 
bias. 



Theorem 8. In the heteroscedastic random effect model ( 18 ) 



Var W = iEE^4- ( 19 ) 



N 

u^0 i 



Theorem 9. In the heteroscedastic random effects model (18) 

E RE (Var PW (**)) = ~ E I>.«°i" ( 2 °) 

where 

r 

li,u = ^(1 + r 2 ) k (v iik . u - 2i/ i<k v i<u +Pfci/i,«). (21) 

fe=0 



Theorem 10. In the heteroscedastic random effects model (18), the gain coef- 
ficient 7i, M of (21) for Zi = 1 and u =/= 0, in the product reweighted bootstrap 
is 

li,u = ViA(l + T 2 ) H - 1 + M + + t2 ) H (T 2 ) lV ~ Ul Vi,v 

v~Du 

where \9 U \ < (1 + r 2 )((l + r 2 ) r - l)/r 2 . for t 2 = 1 

li,u = ^,«[2 |u| - 1 + 9 u e] + J2 2M »i,v 

vDu 



where I6U < 2 r+1 - 2. 



Theorem 10 establishes that our bootstrap is conservative in the heteroscedas- 
tic case. With r 2 = 1 we have 

liiH > 2 M - l - (2 r+1 - 2)e. 

For the homoscedastic random effects model, the main effects dominate when 
r\ = max0c tt ct) v v jv u is small and the variance components are all within the 
interval [m,M] for < m < M > oo. In the heteroscedastic case we might 
reasonably require every o~\ u € [to, -M]. The analysis we used for Theorem jTj 
also requires the quantities 

max ctic„ — — Zi = 1 
Zi = 
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to be small. 

For r — 2 the only subsets u and v which appear in r)i are u = {j} and 
v = {1,2}. Furthermore ^,{1,2} = 1/^ and so 

max 77 j = max max — = e. 

i ie{i,2} i N 

Then using the same argument we used to prove the second part of Theorem [7] 
we get 

ERE (^p W (^)) 

Var(X) v ; ' 

The case for r > 2 is more complicated. There may be observations i with 
large values for Vi^/vi.u where C u C u. We still get a good approximation 
from the product reweighted bootstrap because even though the individual r\i 
need not always be small, sums of Ui tV over i are small compared to correspond- 
ing sums of Vi jU for C u C u. 



Theorem 11. Jbr </ie heteroscedastic random effects model (181, assume that 
there exist m and M with 0<m<af u <M<oo. Then the product reweighted 
bootstrap with t 2 — 1, satisfies 



7 Nested random effects 

The r-fold crossed random effects model |T]) excludes replicated observations 
by definition: there can be only one Xi for any combination i of factors. If 
two X's are observed to share all index values ij, we can incorporate them by 
introducing an r 4- l'st index which breaks the ties. Conditionally on the 
effects of the first r indices, distinct replicates are independent. That is a\ = 
when r + 1 G u but m ^ {l,2,...,r + l}. The replicate index i r+ i is a factor 
that is nested within the first r factors. 

More generally, we could have s additional indices corresponding to factors 
crossed with each other, but nested within our r outer factors. Then the index 
i € {1, 2, . . . }' r+s uniquely identifies a data point. Ordinary replication has 
s = l. The nesting structure means that 

cr^ = 0, if un{r+l,...,r + «}^0 and !in[r]/[r], (22) 

In words, the effect ei tU is if the factors in u include any of the inner factors 
without including all of the outer factors. 

When one factor is nested within another, such as replicates within subjects, 
it is a common practice to resample or reweight the outer factor only. For 
example, the resampled data set might contain resampled subjects retaining 
the repeated measurements from each of them. 
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In the nested setting, the variance of X under an r + s factor version of the 
random effects model ([I]) is still (1/N) ^2 U ^ v u o\ although many of the a\ 
terms are zero. 

For i <= [r + s] let \i\ — . . . , i r ) be the indices of its outer factors. We 
can study these nested models by introducing the variables 

Tyi\ = Zjil j = |ij Xi> , and 

i' 

M Yi\ = H Zi,1 L*'J = LiJ' 

i' 

so that the sample mean 

is an r-factor ratio estimator. 

When the numbers My^ of replicates for each outer factor vary, we obtain 
a heteroscedastic random effects model in the first r variables. 



8 Example: loquacity of Facebook comments 

We present an analysis of national differences in comment length on Facebook. 
In particular, Facebook users can share links with their friends. Their friends, 
and the posting user, can comment on the link. We compare the length of these 
comments produced by users in the United States using the site in American 
English (US users) and those produced by users in the United Kingdom using 
the site in British English (UK users). We restrict the analysis to US and 
UK users commenting on links shared by US and UK users. We additionally 
consider two different modes by which users can comment: the standard web 
interface to Facebook (web) and an application for some touchscreen mobile 
phones (mobile). 

We treat the logarithm of the number of characters in a comment as the 
outcome in the following random effects model: 

X CT a.i Mem ~t~ ^ ^ £cm,i,u 

where fj, cm is the mean log characters for country c in mode m. Here the mem- 
bers of i are indexes for the user sharing the link (sharer), the user commenting 
on the link (commenter) , and the canonicalized URL being shared (URL). By 
definition, no comments have characters, and so each X in our data set is well 
defined. 

The data consist of X cni: i for a sample of comments by US and UK who are 
using Facebook in American and British English, respectively, during a short 
period in 2011. This sample includes 18,134,419 comments by 8,078,531 com- 
menters on 2,085,639 URLs shared by 3,904,715 sharers. We examine whether 
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these US and UK users post comments of different lengths for both of the modes. 
The duplication coefficients for this data are 



u sh = 17.71, Vam = 7.71, !/„! = 26,854.92 
^sh.com = 5.92, z/ sh ,uri = 12.91, ^ C om,uri 5.19, and 



The coefficient for URLs is conspicuously large, indicating that a naive bootstrap 
would be very unreliable. 

The sample mean for a country and mode is 

^cm — y ■ 

/ j j ^cm,i 

We regard (X cm as an estimate of fi cm conditional on the observed combinations 
of sharers, commenters, and URLs. 

The four sample means for both countries and both modes suggest that 
the US users write longer comments than UK users when commenting on the 
web (/tus, wob = 3.62, /tuK, web = 3.55), while UK users write longer comments 
than US users when commenting via the selected mobile interface (/ius, mobile = 
3.5, Auk, mobile = 3.57). Many differences between US and UK users likely 
contribute to this observed differences. Before searching for causes of these two 
differences, a data analyst would likely want to quantify the evidence for the 
existence and size of these differences. We test whether these two pairs of means 
are likely to be observed given the null hypothesis of no difference in comment 
length between the countries within each mode. 



Using software for Hive (Thusoo et al. 20091, a Hadoop-based map-reduce 



data warehousing and parallel computing environment, we can compute each 
of these four means for a number of bootstrap reweightings of the data, while 
visiting each observation only once. When visiting an observation, the hashed 
identifiers for the factor levels for that observation are each used as seeds to 
random number generators. This allows all nodes to use the same U{0, 2} draw 
in computing the product weight for all observations that share a particular 
factor level. Note that users can be both sharers and commenters. Since users 
can comment on their own shared links, some observations could have the same 
factor level identifier for both the sharer and commenter levels. We use different 
portions of the hashed identifier so that the weights for these two roles are not 
dependent. For each reweighting, we compute four reweighted sample means 

~* ^-/i ^ cm cm cm 

/ . j ^cm^i yv cm,t 

corresponding to c £ {US, UK} and m € {web, mobile}. 

For comparison, we conduct this analysis reweighting one, two, and all three 
of the factors. Figure [I] presents R = 50 bootstrapped differences in the two 
pairs of means when reweighting commenters, commenters and sharers, and all 
three factors. Inspection of these ECDFs confirms that the observed differences 
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Figure 1: Difference between the logged number of characters in comments 
by US and UK users for three different bootstrap rewcightings with R = 50. 
Each data point in the plotted ECDF is the difference in means from a single 
bootstrap reweighting. US users post longer comments than UK users on the 
web, but this difference is reversed for the mobile interface studied. 
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Figure 2: Confidence intervals for the difference between the logged number 
of characters in comments by US and UK users for three different bootstrap 
reweightings with R = 50. Confidence intervals span the 2.5% and 97.5% quan- 
tiles of the normal with variance computed from the bootstrap reweightings. 
While all three analyses reject the null hypothesis, the one- and two-factor anal- 
yses may substantially overstate confidence about the size of the true difference, 
especially in the case of comments posted via the web interface. 
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cannot be attributed to chance, even when accounting for the random main 
and interaction effects of commenters, sharers, and URLs. The bootstrapped 
differences in means are strikingly more dispersed for the three-factor analysis. 
Figure [2] shows 95% confidence intervals for the two differences computed as 
quantiles of the normal distribution with variance computed from the bootstrap 
reweightings. This highlights the substantial overstatement of certainty that can 
come from neglecting the presence of additional random effects. In this case, 
the three analyses would all reject the null hypothesis, but would produce quite 
different confidence intervals. 

For the approximations developed in Section [5] to apply, we require that e 
and r\ be small - that no single level of any random effect make up a large 
portion of the observations and that the number of observations matching on v 
is small compared to the number matching on u factors for all C u C v. We 
find that e = 686,990/18,134,419 = 0.0379, as one URL had 686,990 comments 
in this sample. We also found that -q = 0.767. Because rj is not very small it is 
possible that the variance estimates are conservative. 

9 Discussion 

We have worked conditionally on the observed values holding Zi fixed. It is 
clear that missingness can be informative and thereby introduce a bias into a 
sample mean. 

The way to correct for missingness and even whether to do such a correction 
is problem dependent. In the Netflix data, the company is seeking to predict 
ratings that were not made and so the bias between observed and unobserved 
ratings is of interest. The people who competed in the Netflix contest were 
trying to predict ratings that were actually made and then artificially withheld, 
so the pairs to be predicted were not subject to this bias. For the Facebook 
data, some of the observed difference between the lengths of comments by US 
and UK users may be due to differences in which URLs they comment on. An 
accounting of missingness might involve inferring the likely length of comments 
that would have been made by US and UK users if they had the same propensity 
to comment on particular URLs. An analysis made conditionally on Zi describes 
the statistical stability of comment lengths for the actual pattern of commenting, 
which may be of more interest. 

To make an adjustment for missing data requires some kind of assumption 
about the missingness mechanism. That assumption cannot be tested within a 
given data set because the necessary confirmation values are not available. It 
is clear that reweighting cannot correct a sampling bias because many different 
sample biases may be consistent with an observed data set. In a given problem 
with our preferred adjustment for missingness built in to the statistic of interest, 
we could then consider how to bootstrap the resulting bias adjusted statistics. 
Alternatively, if the statistic is partially identified, then we could consider how 
to bootstrap the resulting sample bounds on the statistic. It is not obvious 
which bootstrap method would suit these tasks, but it seems clear that in the 
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random effects context product reweighting will succeed more generally than 
naive bootstrapping. 

We have used the variance of a sample mean as a way to identify a suitable 
bootstrap method. Plain sample means are practically important. For example, 
click through rates, or feature usage rates are means or ratios of means. Even 
for this simple problem, naive bootstrapping methods are severely downwardly 
biased in the random effects setting. Product weighting replaces this bias by a 
small upward bias that is more acceptable in applications. 

A bootstrap method that under- estimates the variance of a mean cannot be 
expected to work well on other problems. One that is properly calibrated or 
conservative for the variance of a scalar sample mean will also work in some 
other settings. 

The extension to multivariate means is very straightforward. When Xi <G M. d 
for d > 1 we may replace the variances o\ or erf u by variance-covariance matrices 
E u or E j „ respectively in the variance formulas. This follows by considering 
the variance of (b T Xi for vectors 6 G Mr. 



Bootstrap correctness extends from means to other statistics. See Hall ( 1992 1 



and Mammen (1992). The extension to smooth functions g(X) of means is via 
Taylor expansion, when g has a Jacobian matrix with full rank at E(X). 

The bootstrap is usually used to get confidence intervals, not variance es- 
timates. For an asymptotically unbiased statistic that satisfies a central limit 
theorem, a properly calibrated variance yields asymptotically correct bootstrap 
percentile confidence intervals. An overestimated variance yields conservative 
percentile intervals. 

Another way to extend from means to other statistics is via estimating equa- 
tions. If the parameter 9 is defined by J^i Zim(Xi\ 9) = then we may test the 
hypothesis that 9 = 9$ by testing whether m,(Xi\ Oq) has mean zero. In practice 
one would ordinarily form a histogram of resampled 9* values and construct a 
confidence interval from them. 



The heteroscedastic random effects model ( 18 ) has 2 r — 1 variance parameters 
for each observation. Such a model can arise in an t — fold generalization of factor 
analysis. Suppose that Fi u is a nonrandom factor depending on indices in the 
set u C {1, 2, . . . , r} and that Li v is a mean zero random loading depending on 
indices in the set v C {1,2, ... ,r} where uDv = 0. Let 

Xi = fi-\ h F iu L iv H h £i : {i : ..., r } 

where the ellipses hide other factors of the type just described for different sub- 
sets of the variables. The term shown contributes F? Var(Lj u ) to the variance 
component for subset v on observation i. Even if the loadings have constant vari- 
ance, unequal factor values will make this variance component heteroscedastic. 
The factors and loadings could both have a product form so that they contribute 

j£u j£v 

to Xi generalizing the SVD, but a product form is not necessary. 
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A generalized factor model would be extremely hard to estimate. However, 
the total variance from all those different variance contributions is handled by 
product reweighting, with a small upward bias in the bootstrap variance of a 
mean. A similar phenomenon is well known in the context of the wild bootstrap 
(Mammen 19931 for the linear model. There a different distribution is posited 
for each of n observations in a regression and the bootstrap process provides 
reliable inferences for the regression coefficients without having to accurately 
estimate all n distributions. 
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Appendix: proofs 

This appendix contains theorem proofs and a few lemmas. The theorems are 
restated to make it easier to follow the steps. Equation numbers that appear in 
the theorem statements from the article are preserved in this appendix. 

Proof of Theorem Q] 

Theorem [l| In the random effects model ([!]) 

Var(A) = 1 J2 

Proof. The numerator of X in @ is Zi Xi = Nu + £\ Z i e < • There - 

fore the variance of X under the random effects model is 

Var(A) = ^(EE^*' E E e i,« e *>') 

% i' u^0 u'^0 

= E CT " E E Z i Z i' 1 i^=K l 

u^0 i i' 

u^0 i 

= A? E v "°$- D 

u^0 
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Proof of Theorems [2| [3], and |4[ 

Here we prove the theorems about naive bootstrap sampling. Theorem [2] is 
about naive resampling and Theorem [3] handles naive reweighting. Theorem [4] 
is about bootstrap stability. 

Theorem [2j Under the random effects model |l]) , the expected value of the 
naive bootstrap variance of X is 

E RE (Var NB (X)) = 1 £ al(l - ^). 

Proof. A {/ -statistic decomposition of the sample variance is 
Var NB (X) = ^EE - X i'f 

i i' 

2 



^EE Zi M E £i - u 



2N 3 

i i' u^0 



Si 



Under the random effects model 



E RE (Var NB (X)) = ^ E E E 2a ^ 1 ~ ^0 

i i' u^0 

To prove Theorem [3j we begin with a lemma on the covariance of pairs of 
observations under the random effects model. 

Lemma 1. Let Xi follow the random effects model and let Yi = Xi — X . 
Then 

E RE (X;AV) =M 2 + E a ^=< ( 23 ) 

u=t0 

and 

EJaKdWO = E r - N ^- N ^ + %)- (24) 

u^0 



Proof. Equation (23) follows directly from the random effects model definition. 



Expanding YiY^ yields 

XiXi' — — Zi"XiXi" — — Zi"Xi'Xi" + — — Zi"Zi"'Xi"Xi"' . 
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Because fj, cancels from Yi we may assume that /i — while proving (24). Now 
Therefore 

u^0 i" 



which reduces to ( 24 ) . □ 
Theorem [3j In the random effects model 

Ere(V^nbb(^)) = jj £ ^(l - § 

Proo/. Let Y~ 2 = X, - X and T y * = J2i WiZiYi. Then 

Ere(V^nbb(X*)) = ^Le re (E NB b((T* -XN*) 2 )) 

i i' 

= H ^i^i'ERE 0% )E N BB (Wi Wi> ) . 

i %' 

Next, Enbb (WiWi') = 1 + r 2 l i=i /. Therefore 
E RE (Var NBB pT)) = ~ £ £ Z i ^E RE (T i Y i + ~ £ ^E RE (lf ). (25) 

i i' i 

The double sum in (25) vanishes because WiZiYi = 0. Then from Lemma [lj 
the coefficient of a 2 in J25l) is 



2lyZ i (l-^^ + ^)=^(N-2v u + u u ) 
N 2 ^ \ N N J N 2y > 

establishing §8$. □ 

Theorem |4j Let W and Wi.b be IID random variables with mean 1 variance 
t 2 and kurtosis k w < oo. Then holding Yi = Xi — X fixed, 

Var NBB (\^r NBB (X*)) = |^(2 1 /i<,V ' ' ;i 



BJV 2 V N 

where a 2 = (1/N) J2i ZiY?, and k x = (1/N) J2i z i Y il^ - 3. A delta method 
approximation gives 

ct 4 t 4 / 2B k(k x + 3Y 



Var NB B(s NBB ) - BN 2\ B _ l < N 
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Proof. First, the variance of VarNBB (A'*) scales as 1/B so we can work with 
B = 1 and divide the result by B. For B = 1, we drop the subscript 6 from 
Ws. We will use the identity J2i ZiWiYi = £\ - 1)^. If S = 1, then 

Var NB B(Var NB B (X*)) equals 



E NBB ((J2 z i w i Y i 



2 2 \ 2 

a t N 



N 



i i i' 

3 X > -m / /, TT «\9.\0.-wrA f (7 T ^ 



_ r 4 g 4 (^ + 3)(/v a; + 3) 3T 4 cr 4 3r 4 cr 4 (^ + 3) crM 

~ a^ + ~7p at 3 Jp~ 
, «(*x + 3y 

For the second part 



VarNBB ( s nbb ) = Enbb(snbb) 



B — 1 B 



where k* is the kurtosis of X* = J2i z i w i Y i/ J2i z i w i- The delta method 
approximation to Enbb(snbb) ^ s T 2 a 2 /N. For the kurtosis, we make the Taylor 
approximation 

X* = X + J2 z i(Wi-l)Yi. 

i 

The expected value of X* — X reuses much of the above computation and yields 

T 4 C7 4 / k(k x + 3)' 



E NB b((X*-X) 4 )^ y2 v 

Therefore n* = k(k x + 3)/N and so 

, 2 . r 4 cr 4 / 2£ /t(/t x + 3)\ 
Var NBB Snrr = n h • □ 

Proof of Theorems [5], [6], and [7[ 

Theorem [5] gives an exact expression for the gain coefficients of the Bayesian 
pigeonhole bootstrap in the constant variance crossed random effects model. 
Theorem [6] gives an interpretable approximation to those gain coefficients. The- 
orem [7] shows factorial reweighting gives nearly the correct variance when e and 
77 are both small. 



2G 



Theorem [5} In the random effects model ([!]) 

E RE (Var PW pr)) ^ ^ 



N 



where 

r 

7« = E^ 1 + r2 ) k i v k,u - 2v ktU + p k v u ). ([151 



fc=0 



Proof. We begin along the same lines as Theorem [3] and find that 

E RE (Var P w(^*)) = ^J2J2 ZiZi ' E ™( Y ^ E ™( WiWi ')- 



N 2 

i i' 



For the product weights used in this bootstrap, 

E FW (WiWi>)= [] (l+r 2 ) = (l + r 2 )l M -'l, 

with Epw(W / iW / j') = 1 if i and i' are not equal in any components. 
From Lemma 111 the coefficient of c 2 in ERE(Varpw(^*)) is 

1 £ £ (i^ - %l - %i + £) (i + T 2 )l Mii 'l 



jV2^zl^ *V TV TV N 

i %' 



1 r 

£U + E E W 1=*^*' f 1 *.^ 



1 r 

fc=0 

Next we establish an interpretable approximation to the Bayesian pigeonhole 
bootstrap variance, using the quantity e = maxjmaxj Ni^^/N which is small 
unless the data are extremely imbalanced. 

Theorem [6j In the random effects model ([!]), the gain coefficient (151 for 
u =/= in the product reweighted bootstrap is 

lu = u„[(l + r 2 )H - 1 + 9 u e] + EC 1 + T 2 ) H (r 2 )l*-"k gjjl 

where |0 U | < (1 + r 2 )((l + t 2 )'' - 1)/t 2 . For t 2 = 1, 

7u = ! / u [2l"l-l + M + E 2H ^' 
where |0„| < 2 r+1 - 2. 
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Proof. The second claim follows immediately from the first which we now prove. 
We will approximate j u = J2l=o(l + T2 ) k ( v k,u - 1v k . u + p k v u ). First 



E(l + t 2 )V« = ± E(l + r 2 ) k £ E WHfcl*.^ 

fc— i 

^E^+^'EE^ 1 ^- 

w~Du i i' 

= ^ E a + - 2 ) H E E E ^ 

= ^(l + T 2)H^ ( _i)k-H^. 



TV 

fe=0 fc=0 



Writing w € [u, u] for u C to C v, 

E( 1 + tJ ) h E(- 1 ) m,/ « 



= E^ E (l + r 2 )H(-l)l«-H 

lOu w£[u,v] 



=E^E( i vy^ i+ ^ H ~ 

= 5> t) (l+T 2 )M(T 2 )l»-"l. 



For the other parts of 7„, we use quantities 9 that satisfy bounds < 6 < 1. 
There are several such quantities, distinguished by subscripts, and defined at 
their first appearance. First, we have the bounds 

^°=l-r^ e, and ^=0^6, 1 < k < r. (26) 
Next, for ii^0, 

Vq, u = E Z i N i^ N ifl = ]y E Z * N i,ui l ~ r °i,0t) = v u {\ - r6 Q . u e) 

i i 

and for fc = 1, . . . , r 

^fc,u = E z iNi iU Ni,k = — E ZiNi, u 6i,ke = Vu6k, u e, 

i i 

Turning to pfe, 

Po = ^E ^i-^i.o = ^ E ~~ €r6i v) = 1 - er6l o, and 

i i 

Pk = 772 E ^i-^.fc = ^ E Z iOiM = 8kt, k = l,...,r. 
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Now -2u ,u + PoV u = -v u + ^ u (25o,« - #o)^e and 

r r 

E(l + T 2 )\-2V k . u + Pk v u ) = ^ u + T 2 ) fe (0 fe - 20 fe> „)e 



fe=i fe=i 

Therefore 



7' 



■ u = i/ u ((l + r 2 )l"l - 1 + e u e) + 2 ".(1 + r 2 )(r 2 )l w -"l, where 



On 



9u = J2( 1+T ^ k ^- 2e ^ 



k=l 

The proof follows because -1 < 6 k - 29 k u < 1 and ££ =1 (1 + r 2 )* = (1 + 
t 2 )((1 + t 2 )'--1)/t 2 . ^" ' □ 

Theorem [7j For the random effects model ([I]) and the product reweighted 
bootstrap with t = 1, the gain coefficient for nonempty u C [r] satisfies 

2 \u\ _ 1 _ ( 2 r+i _ 2 j £ < 7u < 2 |tt|^ + 2}? )l«-«l - l + (2 r+1 - 2)e. 
If til ere exist m and M with < m < g\ < M < oo for all u ^ 0, then 

Proof. From Theorem [6] 

^ < -1 + V 2^ v - u \ + (2 r+1 - 2)e 

= 2l"l(l + 2r))^-^ - 1 + (2 r+1 - 2)e, 

and then using v v > 0, 

^ > 2l"l - 1 - (2 r+1 - 2)e. 

For the second claim, small r\ means that the variance is dominated by 
contributions er 2 ^ for which jyy w v {j}- Now 

E T'"' 7 " = 2 «u*2[l + 0(v + e)] 
|«|=1 |u|=l 

where the constant in O(-) can depend on r, and 

E 7u°« = E ^«[ 2N + °(»? + e )i = °iv) E 

|u|>i M>i |«|=i 



2!) 



Similarly £ N>1 J u ^l = 0(77) J2\u\=i v v?i- Therefore 



E 



RE (Varp W (X*)) (l + 0(r/ + 6))^ |ttN1 ^ 
Var(X) (l + OW)E H =i^ 



□ 



Proof of Theorems [8] through 

Here we prove the theorems for the heteroscedastic case. We begin with a 
lemma. 



Lemma 2. Let Xi follow the heteroscedastic random effects model (18) and let 
Y i= Xi- X. Then 



and 



E 



RE 



(27) 



(28) 



Proof. Equation (27) follows directly just as the analogous expression did in 
Lemma [T] Once again, expanding Y^Y^ yields 

XiXi' — — Zi"XiXi" — — Zi"Xi'Xi" + — — Zi"Zi"'Xi"Xi"' . 



and we may assume that /i — while proving ( 28 ) . Now 

Ere (jj ^ Zj'XjXj^ = — ^ Z^ z i' 1 iu=i' u (J i,u = Y 53 °i,u Vi w 

%' U^0 %' U^0 i 

72 H Z^Z^ Zi " Zi "' ll '^ ,(j2 i"-u 

U^0 i" i'" 

\ /2/2 Zi " a i"^i",u 

u^0 i" 

u^0 



and 



RE 



N 2 



N 



which together establish (28). 



□ 



Theorem |8| In the heteroscedastic random effect model ( 18 ) 



(29) 



itj£0 i 
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Proof. The proof is very similar to that of Theorem [T] 
Theorem [9j In the heteroscedastic random effects model ( 18 1 
E RE (Varpw(^*)) = ^ £ £^« CT i,«> 



□ 



(201 



u^0 i 



where 



li 



'i,u = + T 2 ) k (lSi, k . u - IVi^Vi^ + V k Vi^ u ). 



(211 



k=0 



Proof. We begin along the same lines as Theorem [3] and find that 

ER E (VarpwpT)) = ^Y,J2 ZiZ i' E ^ Y i Y i')^(W i W i :). 



As in TheoremjHJ E PW (W i W i = (1 + r 2 )l M «'l. 
From Lemma [2] 



E 



RE (Var PW (A > *)) = ^£EE Z&v{\ + r 



2\|M i4 ,| 



u^0 i i' 



2 _ 2 



(30) 



The contribution from the last term in the parentheses of ( p0| ) is 

1 r r 

x £ ^E(! + r2 ) fc E ^ = E + ^ 2 ) fe ^- 

fc = 2 fe = 

Therefore the coefficient of <r? , in EriE(Varp\v(^*)) (when = 1) is 



^E^'E Wi=*a + - 2 ) fe (i^ - *0 + ^ E(i + - 2 ) 



fe=0 



fe=0 



1 r 



□ 



fc=0 



Theorem 10, In the heteroscedastic random effects model (18), the gain co- 



efficient 7i iU of (21 1 for Zi = 1 and u ^ 0, in the product reweighted bootstrap 
is 



7i,« = + r 2 ) 1 " 1 - 1 + M] + E(l + r 2 )M (r 2 )l"-"l^ 

where |0„| < (1 + r 2 )((l + r 2 )'' - l)/r 2 . For r 2 = 1 

7i,« = M 2 '"' - 1 + M + E ^"'"v" 

where |0„| < 2 r+1 - 2. 
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Proof. From Theorem|9| j ijU = Efc=o(l + T 2 ) k {i>i,k,u - 2vi,kVi, u + v k Vi^ u ). The 
proof is similar to that of Theorem [6j so we summarize the steps. First 



£(1 + r 2 )^ i)M = Ml + t 2 )H(t 2 )I— I 



fc=0 vD-b 



Next, Vi Q = 1 — rOifie and vq = 1 — r6*o, while for fc > 1, ^ ^ = Oi^e and 
ITfc = 9k€. Here, all of the 9's are in the interval [0, 1]. The result follows as in 
Theorem U □ 



Theorem For the heteroscedastic random effects model (18), assume 

that there exist m and M with < m < u\ < M < oo. Then the product 
reweighted bootstrap with r 2 = 1, satisfies 

Proof. First we show that main effects dominate. For |u| > 1, 

5>*,«o?, u < M5><,»(2 |u| - 1 + 2 r + 1 e ) + £ 2^ 

i, 



M ( i/ u (2' u ' - 1 + 2 r+1 e) + 2 '* 



= (2M-l)Afi/ u (l + 0(e + »j)) 
= O(r)) max iAr,\, 

l<?'<r XJS 

and similarly £7 v itU o\ u = 0{rj) maxi<.,< r Uyy. For u = {j}, 

i i 

= mi/ {i} (l + 0(e)). 

Therefore 

'^-(l + Ofa + e)). 



E RE (Varp W (X*)) _ Ei 7i,fa}< {j} 
Var(X) _ £ . Ei=i "i.) ,!^ , ,: 



Next we show that the main effects are properly estimated 

r r 

i j = l i j=l 

r 

* i=i 

r 
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